{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chapter 16\n",
    "---\n",
    "# Logistic Regression\n",
    "\n",
    "Despire being called a regression, logistic regression is actually a widely used supervised classification technique. \n",
    "Allows us to predict the probability that an observation is of a certain class\n",
    "\n",
    "## 16.1 Training a Binary Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "model.predict: [1]\n",
      "model.predict_proba: [[0.18823041 0.81176959]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data[:100,:]\n",
    "target = iris.target[:100]\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_standardized = scaler.fit_transform(features)\n",
    "\n",
    "logistic_regression = LogisticRegression(random_state=0)\n",
    "model = logistic_regression.fit(features_standardized, target)\n",
    "\n",
    "new_observation = [[.5, .5, .5, .5]]\n",
    "\n",
    "print(\"model.predict: {}\".format(model.predict(new_observation)))\n",
    "print(\"model.predict_proba: {}\".format(model.predict_proba(new_observation)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discussion\n",
    "Dispire having \"regression\" in its name, a logistic regression is actually a widely used binary lassifier (i.e. the target vector can only take two values). In a logistic regression, a linear model (e.g. $\\beta_0 + \\beta_i x$) is included in a logistic (also called sigmoid) function, $\\frac{1}{1+e^{-z }}$, such that:\n",
    "$$\n",
    "P(y_i = 1 | X) = \\frac{1}{1+e^{-(\\beta_0 + \\beta_1x)}}\n",
    "$$\n",
    "where $P(y_i = 1 | X)$ is the probability of the ith obsevation's target, $y_i$ being class 1, X is the training data, $\\beta_0$ and $\\beta_1$ are the parameters to be learned, and e is Euler's number. The effect of the logistic function is to constrain the value of the function's output to between 0 and 1 so that i can be interpreted as a probability. If $P(y_i = 1 | X)$ is greater than 0.5, class 1 is predicted; otherwise class 0 is predicted\n",
    "\n",
    "## 16.2 Training a Multiclass Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data\n",
    "target = iris.target\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_standardized = scaler.fit_transform(features)\n",
    "\n",
    "logistic_regression = LogisticRegression(random_state=0, multi_class=\"ovr\")\n",
    "#logistic_regression_MNL = LogisticRegression(random_state=0, multi_class=\"multinomial\")\n",
    "\n",
    "model = logistic_regression.fit(features_standardized, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discussion\n",
    "On their own, logistic regressions are only binary classifiers, meaning they cannot handle target vectors with more than two classes. However, two clever extensions to logistic regression do just that. First, in one-vs-rest logistic regression (OVR) a separate model is trained for each class predicted whether an observation is that class or not (thus making it a binary classification problem). It assumes that each observation problem (e.g. class 0 or not) is independent\n",
    "\n",
    "Alternatively in multinomial logistic regression (MLR) the logistic function we saw in Recipe 15.1 is replaced with a softmax function:\n",
    "$$\n",
    "P(y_I = k | X) = \\frac{e^{\\beta_k x_i}}{\\sum_{j=1}^{K}{e^{\\beta_j x_i}}}\n",
    "$$\n",
    "where $P(y_i = k | X)$ is the probability of the ith observation's target value, $y_i$, is class k, and K is the total number of classes. One practical advantage of the MLR is that its predicted probabilities using `predict_proba` method are more reliable\n",
    "\n",
    "We can switch to an MNL by setting `multi_class='multinomial'`\n",
    "\n",
    "## 16.3 Reducing Variance Through Regularization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegressionCV\n",
    "from sklearn import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data\n",
    "target = iris.target\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_standardized = scaler.fit_transform(features)\n",
    "\n",
    "logistic_regression = LogisticRegressionCV(\n",
    "    penalty='l2', Cs=10, random_state=0, n_jobs=-1)\n",
    "\n",
    "model = logistic_regression.fit(features_standardized, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discussion\n",
    "Regularization is a method of penalizing complex models to reduce their variance. Specifically, a penalty term is added to the loss function we are trying to minimize typically the L1 and L2 penalties\n",
    "\n",
    "In the L1 penalty:\n",
    "$$\n",
    "\\alpha \\sum_{j=1}^{p}{|\\hat\\beta_j|}\n",
    "$$\n",
    "where $\\hat\\beta_j$ is the parameters of the jth of p features being learned and $\\alpha$ is a hyperparameter denoting the regularization strength.\n",
    "\n",
    "With the L2 penalty:\n",
    "$$\n",
    "\\alpha \\sum_{j=1}^{p}{\\hat\\beta_j^2}\n",
    "$$\n",
    "higher values of $\\alpha$ increase the penalty for larger parameter values(i.e. more complex models). scikit-learn follows the common method of using C instead of $\\alpha$ where C is the inverse of the regularization strength: $C = \\frac{1}{\\alpha}$. To reduce variance while using logistic regression, we can treat C as a hyperparameter to be tuned to find thevalue of C that creates the best model. In scikit-learn we can use the `LogisticRegressionCV` class to efficiently tune C.\n",
    "\n",
    "## 16.4 Training a Classifier on Very Large Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data\n",
    "target = iris.target\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_standardized = scaler.fit_transform(features)\n",
    "\n",
    "logistic_regression = LogisticRegression(random_state=0, solver=\"sag\") # stochastic average gradient (SAG) solver\n",
    "model = logistic_regression.fit(features_standardized, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discussion\n",
    "scikit-learn's `LogisticRegression` offers a number of techniques for training a logistic regression, called solvers. Most of the time scikit-learn will select the best solver automatically for us or warn us we cannot do something with that solver.\n",
    "\n",
    "Stochastic averge gradient descent allows us to train a model much faster than other solvers when our data is very large. However, it is also very sensitive to feature scaling, so standardizing our features is particularly important\n",
    "\n",
    "### See Also\n",
    "* Minimizing Finite Sums with the Stochastic Average Gradient Algorithm, Mark Schmidt (http://www.birs.ca/workshops/2014/14w5003/files/schmidt.pdf)\n",
    "\n",
    "## 16.5 Handling Imbalanced Classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data[40:, :]\n",
    "target = iris.target[40:]\n",
    "\n",
    "target = np.where((target == 0), 0, 1)\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_standardized = scaler.fit_transform(features)\n",
    "\n",
    "logistic_regression = LogisticRegression(random_state=0, class_weight=\"balanced\")\n",
    "model = logistic_regression.fit(features_standardized, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discussion\n",
    "`LogisticRegression` comes with a built in method of handling imbalanced classes.\n",
    "`class_weight=\"balanced\"` will automatically weigh classes inversely proportional to their frequency:\n",
    "$$\n",
    "w_j = \\frac{n}{kn_j}\n",
    "$$\n",
    "where $w_j$ is the weight to class j, n is the number of observations, $n_j$ is the number of observations in class j, and k is the total number of classes"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:machine_learning_cookbook]",
   "language": "python",
   "name": "conda-env-machine_learning_cookbook-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
